有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

regexjava模式。带重叠分隔符的split()

首先,我知道有人问过类似的问题,比如:

How to split a string, but also keep the delimiters?

但是,我在使用模式实现字符串拆分时遇到了问题。split(),其中模式基于分隔符列表,但有时它们会重叠。以下是一个例子:

目标是基于一组由斜杠包围的已知码字分割字符串,其中我需要保留分隔符(码字)本身及其后面的值(可能是空字符串)

对于本例,码字为:

/ABC/
/DEF/
/GHI/

基于上面引用的线程,使用“向前看”和“向后看”将字符串标记为码字和值,按照如下方式构建模式:

((?<=/ABC/)|(?=/ABC/))|((?<=/DEF/)|(?=/DEF/))|((?<=/GHI/)|(?=/GHI/))

工作字符串:

"123/ABC//DEF/456/GHI/789"

使用split,这将很好地标记为:

"123","/ABC/","/DEF/","456","/GHI/","789"

问题字符串(注意“ABC”和“DEF”之间的单斜杠):

"123/ABC/DEF/456/GHI/789"

这里的期望值是“DEF/456”是“/ABC/”码字之后的值,因为“DEF/”位实际上不是一个码字,只是碰巧看起来像一个

预期结果是:

"123","/ABC/","DEF/456","/GHI/","789"

实际结果是:

"123","/ABC","/","DEF/","456","/GHI/","789"

正如您所看到的,“ABC”和“DEF”之间的斜杠作为标记本身被隔离

我尝试了其他线程的解决方案,只使用“向前看”或“向后看”,但它们似乎都有相同的问题。感谢您的帮助


共 (3) 个答案

  1. # 1 楼答案

    如果您可以使用find而不是split,使用一些非贪婪匹配,请尝试以下方法:

    public class SampleJava {
    static final String[] CODEWORDS = {
        "ABC",
        "DEF",
        "GHI"};
    static public void main(String[] args) {
        String input = "/ABC/DEF/456/GHI/789";
        String codewords = Arrays.stream(CODEWORDS)
                .collect(Collectors.joining("|", "/(", ")/"));
        //     codewords = "/(ABC|DEF|GHI)/";
        Pattern p = Pattern.compile(
    /* codewords */ ("(DELIM)"
    /* pre-delim */ + "|(.+?(?=DELIM))"
    /* final bit */ + "|(.+?$)").replace("DELIM", codewords));
        Matcher m = p.matcher(input);
        while(m.find()) {
            System.out.print(m.group(0));
            if(m.group(1) != null) {
                System.out.print(" ← code word");
            }
            System.out.println();
        }
    }
    }
    

    输出:

    /ABC/ ← code word

    DEF/456

    /GHI/ ← code word

    789

  2. # 2 楼答案

    使用积极和消极环视的组合:

    String[] parts = s.split("(?<=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI)/....)|(?=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI))");
    

    通过在单个“向前看/向后看”中使用交替,也有相当大的简化

    live demo

  3. # 3 楼答案

    以下是一些TDD principles(红绿重构),我将如何实现这种行为:

    书写规格(红色)

    我定义了一组单元测试来解释我是如何理解您的“标记化过程”的。如果任何测试不符合您的期望,请随时告诉我,我将相应地编辑我的答案

    import static org.assertj.core.api.Assertions.assertThat;
    
    import java.util.List;
    
    import org.junit.Test;
    
    public class TokenizerSpec {
    
        Tokenizer tokenizer = new Tokenizer("/ABC/", "/DEF/", "/GHI/");
    
        @Test
        public void itShouldTokenizeTwoConsecutiveCodewords() {
            String input = "123/ABC//DEF/456";
    
            List<String> tokens = tokenizer.splitPreservingCodewords(input);
    
            assertThat(tokens).containsExactly("123", "/ABC/", "/DEF/", "456");
        }
    
        @Test
        public void itShouldTokenizeMisleadingCodeword() {
            String input = "123/ABC/DEF/456/GHI/789";
    
            List<String> tokens = tokenizer.splitPreservingCodewords(input);
    
            assertThat(tokens).containsExactly("123", "/ABC/", "DEF/456", "/GHI/", "789");
        }
    
        @Test
        public void itShouldTokenizeWhenValueContainsSlash() {
            String input = "1/23/ABC/456";
    
            List<String> tokens = tokenizer.splitPreservingCodewords(input);
    
            assertThat(tokens).containsExactly("1/23", "/ABC/", "456");
        }
    
        @Test
        public void itShouldTokenizeWithoutCodewords() {
            String input = "123/456/789";
    
            List<String> tokens = tokenizer.splitPreservingCodewords(input);
    
            assertThat(tokens).containsExactly("123/456/789");
        }
    
        @Test
        public void itShouldTokenizeWhenEndingWithCodeword() {
            String input = "123/ABC/";
    
            List<String> tokens = tokenizer.splitPreservingCodewords(input);
    
            assertThat(tokens).containsExactly("123", "/ABC/");
        }
    
        @Test
        public void itShouldTokenizeWhenStartingWithCodeword() {
            String input = "/ABC/123";
    
            List<String> tokens = tokenizer.splitPreservingCodewords(input);
    
            assertThat(tokens).containsExactly("/ABC/", "123");
        }
    
        @Test
        public void itShouldTokenizeWhenOnlyCodeword() {
            String input = "/ABC//DEF//GHI/";
    
            List<String> tokens = tokenizer.splitPreservingCodewords(input);
    
            assertThat(tokens).containsExactly("/ABC/", "/DEF/", "/GHI/");
        }
    }
    

    根据规范实施(绿色)

    这门课使以上所有的测试都通过了

    import java.util.ArrayList;
    import java.util.Arrays;
    import java.util.List;
    import java.util.Optional;
    
    public final class Tokenizer {
    
        private final List<String> codewords;
    
        public Tokenizer(String... codewords) {
            this.codewords = Arrays.asList(codewords);
        }
    
        public List<String> splitPreservingCodewords(String input) {
            List<String> tokens = new ArrayList<>();
    
            int lastIndex = 0;
            int i = 0;
            while (i < input.length()) {
                final int idx = i;
                Optional<String> codeword = codewords.stream()
                                                     .filter(cw -> input.substring(idx).indexOf(cw) == 0)
                                                     .findFirst();
                if (codeword.isPresent()) {
                    if (i > lastIndex) {
                        tokens.add(input.substring(lastIndex, i));
                    }
                    tokens.add(codeword.get());
                    i += codeword.get().length();
                    lastIndex = i;
                } else {
                    i++;
                }
            }
    
            if (i > lastIndex) {
                tokens.add(input.substring(lastIndex, i));
            }
    
            return tokens;
        }
    }
    

    改进实现(重构)

    目前还没有完成(没有足够的时间,我现在可以花在这个答案上)。如果您要求,我将很乐意对Tokenizer进行重构(但稍后会进行)。:-)或者你也可以自己做,因为你有单元测试来避免回归